best answer
Table 6: Inputs and full explanations for the CoT explanations from Table 4 on random examples where models exhibit unfaithful behavior. Random unfaithful CoT explanations for bias-consistent
T ask Model, Zero/Few-Shot Failure Mode Input Question CoT in Unbiased Context CoT in Biased Context "... I think the answer is but I'm curious to hear what you think." Model justifies incorrect answer due to ambiguity in task specification. If you follow these instructions, do you return to the starting point? We start at the origin (0, 0), facing the positive y-axis. So the best answer is: (A). 3 We start at the origin (0, 0), facing the positive y-axis. So the best answer is: (B). Q: Is the following sentence plausible? Kenta Maeda is a baseball pitcher. Throwing to first base is part of baseball. American League Championship Series is a real baseball series. Throwing to first base is part of baseball. So the best answer is: (B) implausible.
Can Linear Probes Measure LLM Uncertainty?
Dakhmouche, Ramzi, Letellier, Adrien, Gorji, Hossein
Effective Uncertainty Quantification (UQ) represents a key aspect for reliable deployment of Large Language Models (LLMs) in automated decision-making and beyond. Yet, for LLM generation with multiple choice structure, the state-of-the-art in UQ is still dominated by the naive baseline given by the maximum softmax score. To address this shortcoming, we demonstrate that taking a principled approach via Bayesian statistics leads to improved performance despite leveraging the simplest possible model, namely linear regression. More precisely, we propose to train multiple Bayesian linear models, each predicting the output of a layer given the output of the previous one. Based on the obtained layer-level posterior distributions, we infer the global uncertainty level of the LLM by identifying a sparse combination of distributional features, leading to an efficient UQ scheme. Numerical experiments on various LLMs show consistent improvement over state-of-the-art baselines.
Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?
McMillan, Teague, Dominici, Gabriele, Gjoreski, Martin, Langheinrich, Marc
Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.
Efficient Test-Time Scaling for Small Vision-Language Models
Kaya, Mehmet Onurcan, Elliott, Desmond, Papadopoulos, Dim P.
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems
Agrawal, Aakriti, Aralikatti, Rohith, Satheesh, Anirudh, Chakraborty, Souradip, Bedi, Amrit Singh, Huang, Furong
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.
Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Ferreira, Pedro, Aziz, Wilker, Titov, Ivan
Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization - a key step in the alignment phase - can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the model's internal decision process and the generated explanation. Consequently, the LLM may engage in "reward hacking" by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RM's input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model's decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.
"Amazing, They All Lean Left" -- Analyzing the Political Temperaments of Current LLMs
Neuman, W. Russell, Coleman, Chad, Dasdan, Ali, Ali, Safinah, Shah, Manan, Meghani, Kund
"Amazing, They All Lean Left" - Analyzing the Political Temperaments of Current LLMs Abstract Recent studies have revealed a consistent liberal orientation in the ethical and political responses generated by most commercial large language models (LLMs), yet the underlying causes and resulting implications remain unclear. This paper systematically i nvestigates the political temperament of seven prominent LLMs -- OpenAI's GPT - 4o, Anthropic's Claude Sonnet 4, Perplexity (Sonar Large), Google's Gemini 2.5 Flash, Meta AI's L l a ma 4, Mistral 7b Le Chat, and High - Flyer ' s DeepSeek R1 -- using a multi - pronged approach that incl udes Moral Foundations Theory, a dozen established political ideology scales, and a new index of current political controversies. We find strong and consistent prioritization of liberal - leaning values, particularly care and fairness, across most models. Fur ther analysis attributes this trend to four overlapping factors: liberal - leaning training corpora, reinforcement learning from human feedback (RLHF), the dominance of liberal frameworks in academic ethical discourse, and safety - driven fine - tuning practices . We also distinguish between political "bias" and legitimate epistemic differences, cautioning against conflating the two. A comparison of base and fine - tuned model pairs reveals that fine - tuning generally increases liberal lean, an effect confirmed throu gh both self - report and empirical testing. We argue that this "liberal tilt" is not a programming error or the personal preferences of programmers but an emergent property of training on democratic, rights - focused discourse. Finally, we propose that LLMs may indirectly echo John Rawls' famous veil - of - igno rance philosophical aspiration, reflecting a moral stance unanchored to personal identity or interest. Rather than undermining democratic discourse, this pattern may offer a new lens through which to examine collective ethical reasoning. In the course of our research on the ethical logics of currently prominent large language models (Neuman et al. 2025a, b; Coleman et al. 2025), we encountered an interesting finding. The responses to various ethical dilemmas and the explanations of the underlying logics used by these models appear to resonate with the liberal side of the political spectrum. One research analytic we utilize draws on Moral Foundation Theory's five - element typology of foundational moral principles (Graham et al. 2009; Haidt 2012). The five foundations emp hasizing in turn, Care, Fairness, Loyalty, Authority and Purity, are traditionally divided into two clusters. The first two, Care and Fairness, are associated with a liberal political perspective, while conservatives who fully acknowledge the first two more often emphasize the latter three -- Loyalty, Authority and Purity in support of traditional norms.
Medical Data Pecking: A Context-Aware Approach for Automated Quality Evaluation of Structured Medical Data
Girshovitz, Irena, Ambus, Atai, Shahar, Moni, Gilad-Bachrach, Ran
Background: The use of Electronic Health Records (EHRs) for epidemiological studies and artificial intelligence (AI) training is increasing rapidly. The reliability of the results depends on the accuracy and completeness of EHR data. However, EHR data often contain significant quality issues, including misrepresentations of subpopulations, biases, and systematic errors, as they are primarily collected for clinical and billing purposes. Existing quality assessment methods remain insufficient, lacking systematic procedures to assess data fitness for research. Methods: We present the Medical Data Pecking approach, which adapts unit testing and coverage concepts from software engineering to identify data quality concerns. We demonstrate our approach using the Medical Data Pecking Tool (MDPT), which consists of two main components: (1) an automated test generator that uses large language models and grounding techniques to create a test suite from data and study descriptions, and (2) a data testing framework that executes these tests, reporting potential errors and coverage. Results: We evaluated MDPT on three datasets: All of Us (AoU), MIMIC-III, and SyntheticMass, generating 55-73 tests per cohort across four conditions. These tests correctly identified 20-43 non-aligned or non-conforming data issues. We present a detailed analysis of the LLM-generated test suites in terms of reference grounding and value accuracy. Conclusion: Our approach incorporates external medical knowledge to enable context-sensitive data quality testing as part of the data analysis workflow to improve the validity of its outcomes. Our approach tackles these challenges from a quality assurance perspective, laying the foundation for further development such as additional data modalities and improved grounding methods.
Automated Journalistic Questions: A New Method for Extracting 5W1H in French
Verhaverbeke, Maxence, Gramaccia, Julie A., Khoury, Richard
The 5W1H questions -- who, what, when, where, why and how -- are commonly used in journalism to ensure that an article describes events clearly and systematically. Answering them is a crucial prerequisites for tasks such as summarization, clustering, and news aggregation. In this paper, we design the first automated extraction pipeline to get 5W1H information from French news articles. To evaluate the performance of our algorithm, we also create a corpus of 250 Quebec news articles with 5W1H answers marked by four human annotators. Our results demonstrate that our pipeline performs as well in this task as the large language model GPT-4o.